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We consider the asymptotic behavior of posterior distributions if 
the model is misspecified. Given a prior distribution and a random 
sample from a distribution Po, which may not be in the support of 
the prior, we show that the posterior concentrates its mass near the 
points in the support of the prior that minimize the Kullback-Leibler 
divergence with respect to Po- An entropy condition and a prior-mass 
condition determine the rate of convergence. The method is applied to 
several examples, with special interest for infinite-dimensional mod- 
els. These include Gaussian mixtures, nonparametric regression and 
parametric models. 

1. Introduction. Of all criteria for statistical estimation, asymptotic con- 
sistency is among the least disputed. Consistency requires that the estima- 
tion procedure come arbitrarily close to the true, underlying distribution, if 
enough observations are used. It is of a frequentist nature, because it pre- 
sumes a notion of an underlying, true distribution for the observations. If 
applied to posterior distributions, it is also considered a useful property by 
many Bayesians, as it could warn one away from prior distributions with 
undesirable, or unexpected, consequences. Priors which lead to undesirable 
posteriors have been documented, in particular, for non- or semiparametric 
models (e.g., [4, 5]), in which case it is also difficult to motivate a particular 
prior on purely intuitive, subjective grounds. 

In the present paper we consider the situation where the posterior distri- 
bution cannot possibly be asymptotically consistent, because the model, or 
the prior, is misspecified. Prom a frequentist point of view, the relevance of 
studying misspecification is clear, because the assumption that the model 
contains the true, underlying distribution may lack realistic motivation in 
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many practical situations. From an objective Bayesian point of view, the 
question is of interest, because, in principle, the Bayesian paradigm allows 
unrestricted choice of a prior, and, hence, we must allow for the possibility 
that the fixed distribution of the observations does not belong to the sup- 
port of the prior. In this paper we show that in such a case the posterior 
will concentrate near a point in the support of the prior that is closest to 
the true sampling distribution as measured through the Kullback-Leibler 
divergence, and we give a characterization for the rate of concentration near 
this point. 

Throughout the paper we assume that Xi,X2, ■ ■ ■ are i.i.d. observations, 
each distributed according to a probability measure Pq. Given a model ^ 
and a prior 11, supported on the posterior mass of a measurable subset 
B C ^ is given by 

„ n „ n 

(1.1) Un{B\Xi,...,Xn)= / llp{X,)dU{P)/ l[p{X,)dU{P). 

Here it is assumed that the model is dominated by a u-finite measure 
/i, and the density of a typical element P £ relative to the dominat- 
ing measure is written p and assumed appropriately measurable. If we as- 
sume that the model is well specified, that is, Pq £ then posterior con- 
sistency means that the posterior distributions concentrate an arbitrar- 
ily large fraction of their total mass in arbitrarily small neighborhoods of 
Pq, if the number of observations used to determine the posterior is large 
enough. To formalize this, we let d be a metric on ^ and say that the 
Bayesian procedure for the specified prior is consistent, if, for every e > 0, 
Iln{{P ■ d{P, Pq) > e}\Xi, . . . ,Xn) ^0, in Po-probability. More specific in- 
formation concerning the asymptotic behavior of an estimator is given by 
its rate of convergence. Let e„ > be a sequence that decreases to zero and 
suppose that, for any constants Mn — > oo, 

(1.2) Un{P£^:d{P,Po)>Mn£n\Xi,...,Xn)^0, 

in Po-probability. The sequence e„ corresponds to a decreasing sequence of 
neighborhoods of Pq, the d-radius of which goes to zero with n, while still 
capturing most of the posterior mass. If (1.2) is satisfied, then we say that 
the rate of convergence is at least e„. 

If Pq is at a positive distance from the model ^ and the prior concentrates 
all its mass on then the posterior is inconsistent as it will concentrate all 
its mass on ^ as well. However, in this paper we show that the posterior will 
still settle down near a given measure P* € and we shall characterize the 
sequences e„ such that the preceding display is valid with d{P,P*) taking 
the place of (i(P, Pq)- 

One would expect the posterior to concentrate its mass near minimum 
Kullback-Leibler points, since asymptotically the likelihood YVi^=ip{Xi) is 
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maximal near points of minimal Kullback-Leibler divergence. The integrand 
in the numerator of (1.1) is the likelihood, so subsets of the model in which 
the (log-)likelihood is large account for a large fraction of the total posterior 
mass. Hence, it is no great surprise that the appropriate point of convergence 
P* is a minimum Kullback-Leibler point in but the general issue of rates 
(and which metric d to use) turns out to be more complicated than expected. 
We follow the work by Ghosal, Ghosh and van der Vaart [8] for the well- 
specified situation, but need to adapt, change or extend many steps. 

After deriving general results, we consider several examples in some detail, 
including Bayesian fitting of Gaussian mixtures using Dirichlet priors on 
the mixing distribution, the regression problem and parametric models. Our 
results on the regression problem allow one, for instance, to conclude that 
a Bayesian approach in the nonparametric problem that uses a prior on the 
regression function, but employs a normal distribution for the errors, will 
lead to consistent estimation of the regression function, even if the regression 
errors are non-Gaussian. This result, which is the Bayesian counterpart of 
the well-known fact that least squares estimators (the maximum likelihood 
estimators if the errors are Gaussian) perform well even if the errors are 
non-Gaussian, is important to validate the Bayesian approach to regression, 
but appears to have received little attention in the literature. 

1.1. Notation and organization. Let Li(,^, =e/) denote the set of all finite 
signed measures on (.^,^) and let conv(=S) be the convex hull of a set of 
measures the set of all finite linear combinations KQi for Qi € ^ 
and Aj > with X^j'^i = 1- For a measurable function /, let Qf denote the 
integral / fdQ. 

The paper is organized as follows. Section 2 contains the main results of 
the paper, in increasing generality. Sections 3, 4 and 5 concern the three 
classes of examples that we consider: mixtures, the regression model and 
parametric models. Sections 6 and 7 contain the proofs of the main results, 
where the necessary results on tests are developed in Section 6 and are of 
independent interest. The final section is a technical appendix. 

2. Main results. Let Xi,X2,-.- be an i.i.d. sample from a distribution 
Pq on a measurable space Given a collection ^ of probability dis- 

tributions on ,£/) and a prior probability measure H on the posterior 
measure is defined as in (1.1) (where 0/0 = by definition). Here it is as- 
sumed that the "model" ^ is dominated by a u-finite measure /x and that 
X I— > p{x) is a density of P £ ^ relative to fi such that the map {x,p) i— > p{x) 
is measurable relative to the product of £/ and an appropriate u-field on 
so that the right-hand side of (1.1) is a measurable function of {Xi, . . . ,X„,) 
and a probability measure as a function of B for every Xi , . . . , Xn such 
that the denominator is positive. The "true" distribution Pq may or may 



4 



B. J. K. KLEIJN AND A. W. VAN DER VAART 



not belong to the model For simplicity of notation, we assume that Pq 
possesses a density po relative to as well. 

Informally we think of the model ^ as the "support" of the prior 11, but 
we shall not make this precise in a topological sense. At this point we only 
assume that the prior concentrates on ^ in the sense that = 1 (but 

we note later that this too can be relaxed) . Further requirements are made in 
the statements of the main results. Our main theorems implicitly assume the 
existence of a point P* £ minimizing the Kullback-Leibler divergence of 
Pq to the model In particular, the minimal Kullback-Leibler divergence 
is assumed to be finite, that is, P* satisfies 



By the convention that logO = — oo, the above implies that -Pq ^ P* and, 
hence, we assume without loss of generality that the density p* is strictly 
positive at the observations. 

Our theorems give sufficient conditions for the posterior distribution to 
concentrate in neighborhoods of P* at a rate that is determined by the 
amount of prior mass "close to" the minimal Kullback-Leibler point P* and 
the "entropy" of the model. To specify the terms between quotation marks, 
we make the following definitions. 

We define the entropy and the neighborhoods in which the posterior is to 
concentrate its mass relative to a semi- metric don The general results are 
formulated relative to an arbitrary semi-metric and next the conditions will 
be simplified for more specific choices. Whether or not these simplifications 
can be made depends on the model convexity being an important special 
case (see Lemma 2.2). Unlike in the case of well-specified priors, considered, 
for example, in [8], the Hellinger distance is not always appropriate in the 
misspecified situation. The general entropy bound is formulated in terms of 
a covering number for testing under misspecification, defined for e > as 
follows: we define Nt{£, ^,d;Po,P*) as the minimal number N of convex 
sets Bi,. . . , Bjy of probability measures on ,£/) needed to cover the set 
{P e^:e< d{P, P*) < 2e} such that, for every i, 



If there is no finite covering of this type, we define the covering number to be 
infinite. We refer to the logarithms log Nt{£, ^ ,d; Pq, P*) as entropy numbers 
for testing under misspecification. These numbers differ from ordinary metric 
entropy numbers in that the covering sets Bi are required to satisfy the 
preceding display rather than to be balls of radius e. We insist that the sets 
Bi be convex and that (2.2) hold for every P £ B^. This implies that (2.2) 




Polog— < oo. 
Po 
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may involve measures P that do not belong to the model 3^ if this is not 
convex itself. 

For e > 0, we define a specific kind of Kullback-Leibler neighborhood of 
P* by 

(2.3) B(e,P*;Po) = |PG=^:-Polog^<e^Po(log^)'<e'}. 

Theorem 2.1. For a given model prior H on ^ and some P* £ 
assume that — Pq log(p*/po) < oo and Po{p/p*) < oo for all P £ 3^ . Suppose 
that there exist a sequence of strictly positive numbers En with e„ ^ and 
ne^ oo and a constant L > 0, such that, for all n, 

(2.4) n(i?(e„,P*;Po))>e-^'^^", 

(2.5) iVf(e,^,d;Po,^'*) <e""" for all e> En- 
Then for every sufficiently large constant M , as oo, 

(2.6) Un{Pe^:d{P,P*)>Men\Xi,...,Xn)^0 m Li{P^). 

The proof of this theorem is given in Section 7. The theorem does not 
explicitly require that P* be a point of minimal Kullback-Leibler divergence, 
but this is implied by the conditions (see Lemma 6.4 below). The theorem 
is extended to the case of nonunique minimal Kullback-Leibler points in 
Section 2.4. 

The two main conditions of Theorem 2.1 are a prior mass condition (2.4) 
and an entropy condition (2.5), which can be compared to Schwartz' con- 
ditions for posterior consistency (see [13]), or the two main conditions for 
the well-specified situation in [8]. Below we discuss the background of these 
conditions in turn. 

The prior mass condition (2.4) reduces to the corresponding condition for 
the correctly specified case in [8] if P* = Pq. Because — Pq log(p*/po) < c>o, we 
may rewrite the first inequality in the definition (2.3) of the set -B(e, P*;Pq) 
as 

-Polog-<-Polog- + e'. 

Po Po 

Therefore, the set 5(e,P*;Po) contains only P £ ^ that are within of 
the minimal Kullback-Leibler divergence with respect to Pq over the model. 
The lower bound (2.4) on the prior mass of B{e,P* \Pq) requires that the 
prior measure assign a certain minimal share of its total mass to Kullback- 
Leibler neighborhoods of P* . As argued in [8], a rough understanding of the 
exact form of (2.4) for the "optimal" rate e„ is that an optimal prior spreads 
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its mass "uniformly" over In the proof of Theorem 2.1, the prior mass 
condition serves to lower-bound the denominator in the expression for the 
posterior. 

The background of the entropy condition (2.5) is more involved, but 
can be compared to a corresponding condition in the well-specified situ- 
ation given in Theorem 2.1 of [8]. The purpose of the entropy condition 
is to measure the complexity of the model, a larger entropy leading to a 
slower rate of convergence. The entropy used in [8] is either the ordinary 
metric entropy log A^(e, or the local entropy log A^(e/2, {P G < 

Pq) < 2e},(i). For d the Hellinger distance, the minimal e„ satisfying 
log A^(e„, = ne^ is roughly the fastest rate of convergence for estimat- 
ing a density in the model ^ relative to d obtainable by any method of 
estimation (cf. [2]). We are not aware of a concept of "optimal rate of con- 
vergence" if the model is misspecified, but a rough interpretation of (2.5) 
given (2.4) would be that in the misspecified situation the posterior concen- 
trates near the closest Kullback-Leibler point at the optimal rate pertaining 
to the model 3^. 

Misspecification requires that the complexity of the model be measured 
in a different, somewhat complicated way. In examples, depending on the 
semi-metric d, the covering numbers Nt{e, ,0^ ,d; Pq, P*) can be related to 
ordinary metric covering numbers N{e, r^,d). For instance, we show below 
(see Lemmas 2.1-2.3) that, if the model ^ is convex, then the numbers 
Nt{e, ^,d;Po,P*) are bounded by the covering numbers N{e,^,d) if the 
distance d{Pi,P2) equals the Hellinger distance between the measures Qi 
defined by dQi = (pq/p*) dPi, that is, a weighted Hellinger distance between 
Pi and P2. 

In the well-specified situation we have P* = Pq, and the entropy num- 
bers for testing can be bounded above by ordinary entropy numbers for the 
Hellinger distance. Thus, Theorem 2.1 becomes a refinement of the main 
theorem of [8]. To see this, we first note that, since — log a; >l — x for x > 0, 

/ p \ ^/^ f 1 

-logPo(^ — j >'i--J^/P^/Pod^l=-h^{p,Po)■ 

It follows that the left-hand side of (2.2) with a = 1/2 and po = P* is bounded 
below by iufpg^. /i^(p,po)- Because Hellinger balls are convex, they are el- 
igible candidates for the sets Bi required in the definition of the covering 
numbers for testing. If we cover the set {P G ^ :2e < h{P, Pq) < 4e} by a 
minimal set of Hellinger balls of radius e/2, then these balls automatically 
satisfy (2.2), by the triangle inequality. It follows that 

Nt{e, ^, h- Po, Po) < N{e/2, {Pe^:2e< h{P, Pq) < 4e}, h). 

The right-hand side is exactly a local covering number of the type used 
by [8]. Because the entropy numbers for testing allow general convex sets 
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rather than balls of a given diameter, they appear to be genuinely smaller 
in general than local covering numbers. (Notably, the convex sets need not 
satisfy a size restriction.) However, we do not know of any examples of gains 
in the setting we are interested in, so that in the well-specified case there 
appears to be no use for the extended covering numbers as defined by (2.2). 
In the general, misspecified situation they are essential, even for standard 
parametric models, such as the one-dimensional normal location model. 

At a more technical level, the entropy condition of [8] ensures the exis- 
tence of certain tests of the measures P versus the true measure Pq. In the 
misspecified case it is necessary to compare the measures P to the mini- 
mal Kullback-Leibler point P* , rather than to Pq. It turns out that the 
appropriate comparison is not a test of the measures P versus the measure 
P* in the ordinary sense of testing, but to test the measures Q{P) defined 
by dQ{P) = {po/p*)dP versus the measure Pq [see (7.4)]. With J2 the set 
of measures Q{P) where P ranges over this leads to consideration of 
minimax testing risks of the type 

iiifsup(Po> + Q"(l-<^)), 

where the infimum is taken over all measurable functions ^ taking values in 
[0, 1]. A difference with the usual results on minimax testing risks is that the 
measures Q may not be probability measures (and may in fact be infinite in 
general). 

Extending arguments of Le Cam and Birge, we show in Section 6 that, for 
a convex set the minimax testing risk in the preceding display is bounded 
above by 

(2.7) inf supp„(Po,Q)", 

0<a<lQg^ 

where the function a i— > Pa{Po,Q) is the Hellinger transform pa{P,Q) = 
J p°^q^~°^ dfi. For Q = Q{P), the Hellinger transform reduces to the map 

Pi^a(.Q{P),Po) = Po{p/p*r, 

also encountered in (2.2). If the inequality in (2.2) is satisfied, then Pq{p/p*)°' < 

and, hence, the set of measures Q{P) with P ranging over Bi can be 
tested with error probabilities bounded by e""^ For e bounded away 
from zero, or converging slowly to zero, these probabilities are exponentially 
small, ensuring that the posterior does not concentrate on the "unlikely 
alternatives" Bi. 

The testing bound (2.7) is valid for convex alternatives but the al- 
ternatives of interest {P G ^ :d{P,P*) > Me} are complements of balls 
and, hence, typically not convex. A test function for nonconvex alterna- 
tives can be constructed using a covering of ^ by convex sets. The entropy 
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condition (2.5) controls the size of this cover and, hence, the rate of con- 
vergence in misspecified situations is determined by the covering numbers 
Nt{e, ^,d;PQ,P*). Because the vahdity of the theorem only relies on the 
existence of suitable tests, the entropy condition (2.5) could be replaced by 
a testing condition. To be precise, condition (2.5) can be replaced by the 
condition that the conclusion of Theorem 6.3 is satisfied with D{e) = e"'^'^. 

2.1. Distances and testing entropy. Because the entropies for testing are 
somewhat abstract, it is useful to relate them to ordinary entropy num- 
bers. For our examples, the bound given by the following lemma is useful. 
We assume that, for some fixed constants c, C > and for every m G N, 



Ai, . . . , Am > with = 1 and every P, Pi, . . . , G ^ with d(P, Pi) < 

cd{P,P* ) for all i, 



Lemma 2.1. // (2.8) holds, then there exists a constant A> depending 
only on c and C such that, for all e > 0, Nt{e, ,^,d;Po,P^ < N{Ae,{P G 
^■.e <d{P,P*) <2e},d). [Any constant A < {1/8) A {I /4.VC) A{^c) works.] 

Proof. For a given constant j4 > 0, we can cover the set '■= {P G 
< d{P,P*) < 2e} with = N{A£,^^,d) balls of radius Ae. If the 
centers of these balls are not contained in then we can replace these 
balls by balls of radius 2Ae with centers in whose union also covers 
the set It suffices to show that (2.2) is valid for Bi equal to the convex 
hull of a typical ball B in this cover. Choose 2A < c. If P G is the center 
of B and Pi £ B for every i, then d{Pi,P*) > d{P, P*) - 2Ae by the triangle 
inequality and, hence, by assumption (2.8), the left-hand side of (2.2) with 
Bi = conv(P) is bounded below by Ei Ai((e - 2Ae)'^ - C{2Ae)'^). This is 
bounded below by for sufficiently small A. □ 

The logarithms logN{Ae,{P G < d{P,P*) < 2e},d) of the cover- 
ing numbers in the preceding lemma are called "local entropy numbers" 
and also the Le Cam dimension of the model ^ relative to the semi- 
metric d. They are bounded above by the simpler ordinary entropy numbers 
logN(Ae, ^,d). The preceding lemma shows that the entropy condition 
(2.5) can be replaced by the ordinary entropy condition log A^(e„, d) < 
ne^ whenever the semi- metric d satisfies (2.8). 

If we evaluate (2.8) with m = 1 and Pi = P, then we obtain, for every 



(2.8) ^Ai(i2(P,,P*)-C^Ai(i2(P,,P)< sup -logPo 




P£^ 



(2.9) 
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(Up to a factor 16, this inequality is also implied by finiteness of the cover- 
ing numbers for testing.) This simpler condition gives an indication about 
the metrics d that may be used in combination with ordinary entropy. In 
Lemma 2.2 we show that if d and the model ^ are convex, then the simpler 
condition (2.9) is equivalent to (2.8). 

Because — log x>l — x for every x > 0, we can further simplify by bound- 
ing minus the logarithm in the right-hand side by 1 — Pq(p/p*)°' . This yields 
the bound 



In the well-specified situation we have Pq = P* and the right-hand side for 
a = 1/2 becomes 1 — / ^Jv^fM'^V^-, which is 1/2 times the Hellinger distance 
between P and Pq. In misspecified situations this method of lower bounding 
can be useless, as 1 — Pq{p/p*)'^ may be negative for a = 1/2. On the other 
hand, a small value of a may be appropriate, as it can be shown that as a | 
the expression 1 — Po{p/p*)°' is proportional to the difference of KuUback- 
Leibler divergences PQlog{p* /p), which is positive by the definition of P* . If 
this approximation can be made uniform in p, then a semi-metric d which is 
bounded above by the Kullback-Leibler divergence can be used in the main 
theorem. We discuss this further in Section 6 and use this in the examples 
of Sections 4 and 5. 

The case of convex models ^ is of interest, in particular, for non- or semi- 
parametric models, and permits some simplification. For a convex model, the 
point of minimal Kullback-Leibler divergence (if it exists) is automatically 
unique (up to redefinition on a null-set of Pq). Moreover, the expectations 
Po{p/p*) are automatically finite, as required in Theorem 2.1, and condition 
(2.8) is satisfied for a weighted Hellinger metric. We show this in Lemma 2.3, 
after first showing that validity of the simpler lower bound (2.9) on the con- 
vex hull of ^ (if the semi-metric d is defined on this convex hull) implies 
the bound (2.8). 

Lemma 2.2. If d is defined on the convex hull of ^ , the maps P i— > 
d'^{P,P') are convex on conv{j^) for every P' ^ and (2.9) is valid for 
every P in the convex hull of , then (2.8) is satisfied for ^d instead of d. 

Lemma 2.3. If ^ is convex and P* € ^ is a point at minimal Kullback- 
Leibler divergence with respect to Pq, then Po{p/p*) < 1 for every P ^ ^ 
and (2.8) is satisfied with 
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Proof of Lemma 2.2. For the proof of Lemma 2.2, we first apply the 
triangle inequality repeatedly to find 

J2Kd^iP^,P*) < 2Y,\d\PuP)+2d\P,P*) 

i i 

< 2 J2 Kd\Pi,P) + Ad^ (^P, Kp}j + 4tZ' XiPuP*^ 

<6Y.^^d\Pi,P) + XiP„P*^ , 

by the convexity of d"^. It follows that ^^(^ . XiPi,P*) > (1/4) J2i Xid^{Pi,P*)- 
3/2 Y.i Xid^{Pi,P). If (2.9) holds for P = Ei KPi^ then we obtain (2.8) with d^ 
replaced by d^ /A and C = 6. □ 

Proof of Lemma 2.3. For P G define a family of convex combina- 
tions {Pa : A G [0, 1]} C =^ by Pa = AP+ (1 - A)P*. For ah values of A G [0, 1], 

(2.10) < /(A) := -Po log ^ = -Po log (l + A - l) ) , 

since P* G ^ is at minimal Kullback-Leibler divergence with respect to Pq 
in by assumption. For every fixed y>0, the function A i— > log(l + Xy)/X 
is nonnegative and increases monotonically to y as A J, 0. The function is 
bounded in absolute value by 2 for y G [—1,0] and A < ^. Therefore, by the 
monotone and dominated convergence theorems applied to the positive and 
negative parts of the integrand in the right-hand side of (2.10), 

/'(0+) = l-Po(^). 

Combining the fact that /(O) = with (2.10), we see that /'(0+) > and, 
hence, we find Po{p/p*) < 1- The first assertion of Lemma 2.3 now follows. 

For the proof that (2.9) is satisfied, we first note that — logx > 1 — x, so 
that it suffices to show that 1 - Poip/p*)^^"^ > d'^{P,P*). Now 

f{V^-Vpf^dfi = l + Po^-2Po.[^<2-2Po.[^, 
J p* p* \ p* \ p* 

by the first part of the proof. □ 

2.2. Extensions. In this section we give some generalizations of Theo- 
rem 2.1. Theorem 2.2 enables us to prove that optimal rates are achieved 
in parametric models. Theorem 2.3 extends Theorem 2.1 to situations in 
which the model, the prior and the point P* are dependent on n. Third, we 
consider the case in which the priors n„ assign a mass slightly less than 1 
to the models 
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Theorem 2.1 does not give the optimal rate of convergence for 
finite-dimensional models both because the choice e„ = l/-^/n is exchided 
(by the condition ne^ — > oo) and because the prior mass condition is too 
restrictive. The following theorem remedies this, but is more complicated. 
The adapted prior mass condition takes the following form: for all natural 
numbers n and j, 



2 „-2 , 



n(P e ^-.jSn < d{P, P*) < 2jen) ^ ^ne,, 

^ ^ n(s(.„,,p*;Po)) - 



Theorem 2.2. For a given model prior H on ^ and some P* £ ^ , 
assume that —Polog{p* /po) < oo and Po{p/p*) < oo for all P E If En 
are strictly positive numbers with e„ — > and liminfne,^ > 0, such that 

(2.11) and (2.5) are satisfied, then, for every sequence Mn oo, as oo, 

(2.12) Un{P£^:d{P,P*)>Mnen\Xi,...,Xn)^0 in Li{Pq). 

There appears to be no compelling reason to choose the model ^ and 
the prior IT the same for every n. The validity of the preceding theorems 
does not depend on this. We formalize this fact in the following theorem. 
For each n, we let ^„ be a set of probability measures on given by 

densities pn relative to a cr-finite measure /u„ on this space. Given a prior 
measure n„ on an appropriate c-field, we define the posterior by (1.1) with 
P* and n replaced by P* and n„. 

Theorem 2.3. The preceding theorems remain valid if , 11, P* and 
d depend on n, hut satisfy the given conditions for each n ( for a single 
constant L). 

As a final extension, we note that the assertion P^YiniP G ■ dniP, Pj^) ^ 
M„e„|Xi, . . . ,Xn) ^ of the preceding theorems remains valid even if the 
priors n„ do not put all their mass on the "models" (but the models 
do satisfy the entropy condition). Of course, in such cases the posterior 
puts mass outside the model and it is desirable to complement the above 
assertion with the assertion that n„(=?3^„|Xi, . . . , X„) — > 1 in Li{Pq). The 
latter is certainly true if the priors put only very small fractions of their 
mass outside the models More precisely, the latter assertion is true if 

This observation is not relevant for the examples in the present paper. How- 
ever, it may prove relevant to alleviate the entropy conditions in the preced- 
ing theorems in certain situations. These conditions limit the complexity of 
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the models and it seems reasonable to allow a trade-off between complexity 
and prior mass. Condition (2.13) allows a crude form of such a trade-off: a 
small part of the model may be more complex, provided that it receives 
a negligible amount of prior mass. 

2.3. Consistency. The preceding theorems yield a rate of convergence 
En ^0, expressed as a function of prior mass and model entropy. In certain 
situations the prior mass and entropy may be hard to quantify. In contrast, 
for inferring consistency of the posterior, such quantification is unnecessary. 
This could be proved directly, as [13] achieved in the well-specified situation, 
but it can also be inferred from the preceding rate theorems. [A direct proof 
might actually give the same theorem with a slightly bigger set i?(e, P*;Po)-] 
We consider this for the situation of Theorem 2.1 only. 

Corollary 2.1. For a given model ^ , 'prior H on and some P* G 
assume that —PQlog{p*/po) < oo and Po{p/p*) < oo for all P £ !^ . Sup- 
pose that, for every e > 0, 

U{B{e,P*;Po))>0, 

(2.14) 

supNt{ij,^,d;Po,P*)<oo. 

r]>e 

Then for every e > 0, as oo, 

(2.15) Un{P£^:d{P,P*)>e\Xi,...,Xn)^0 in Li{P^). 

Proof. Define functions / and g by /(e) = Il{B {e , P* ; Pq)) and g{e) = 
sup^>£ Nt{7], .^,d;Po,P*). We shall show that there exists a sequence e„ — > 

2 2 

such that f{£n) ^ e"""*"" and g{£n) ^ e"^" for all sufficiently large n. This 
implies that the conditions of Theorem 2.1 are satisfied for this choice of e„ 
and, hence, the posterior converges with rate at least 
To show the existence of define functions /i„ by 

hn{e)=e--^'^g{e) + j^y 

This is well defined and finite by the assumptions and (e) ^ as n ^ cxo , 
for every fixed e > 0. Therefore, there exists En i with hn{£n) — > [e.g., fix 
ni < < • • • such that hn{^/k) < 1/fc for all n > n^; next define e„ = 1/A; 
for nk<n< n^+i]. In particular, there exists an N such that hn{£n) < 1 for 

2 2 

n> N. This implies that f{£n) > e~"^" and g{£n) ^ e"^" for every n> N. 
□ 
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2.4. Multiple points of minimal Kullback-Leibler divergence. In this sec- 
tion we extend Theorem 2.1 to the situation where there exists a set ^P* of 
minimal Kullback-Leibler points. 

Multiple minimal points can arise in two very different ways. First consider 
the situation where the true distribution Pq and the elements of the model 
^ possess different supports. Because the observations are sampled from Pq, 
they fall with probability one in the set where po> and, hence, the exact 
nature of the elements p of the model ^ on the set {po = 0} is irrelevant. 
Clearly, multiple minima arise if the model contains multiple extensions of 
P* to the set on which pQ = 0. In this case the observations do not provide 
the means to distinguish between these extensions and, consequently, no 
such distinction can be made by the posterior either. Theorems 2.1 and 2.2 
may apply under this type of nonuniqueness, as long as we fix one of the 
minimal points, and the assertion of the theorem will be true for any of the 
minimal points as soon as it is true for one of the minimal points. This follows 
because, under the conditions of the theorem, (i(Pj*,P|) = whenever P^ 
and P2 agree on the set po > 0, in view of (2.9). 

Genuine multiple points of minimal Kullback-Leibler divergence may oc- 
cur as well. For instance, one might fit a model consisting of normal distribu- 
tions with means in (—00, —1] U [1, 00) and variance one, in a situation where 
the true distribution is normal with mean 0. The normal distributions with 
means —1 and 1 both have the minimal Kullback-Leibler divergence. This 
situation is somewhat artificial and we are not aware of more interesting ex- 
amples in the nonparametric or semiparametric case that interests us most 
in the present paper. Nevertheless, because it appears that the situation 
might arise, we give a brief discussion of an extension of Theorem 2.1. 

Rather than to a single measure P* E the extension refers to a finite 
subset ^* C ^ of points at minimal Kullback-Leibler divergence. We give 
conditions under which the posterior distribution concentrates asymptoti- 
cally near this set of points. We redefine our "covering numbers for testing 
under misspecification" N^^e, ^,d;PQ, ^*) as the minimal number N of 
convex sets Bi,. . . , Bjy of probability measures on needed to cover 

the set {P£^:e< d[p, ^*) < 2e} such that 



This roughly says that, for every P £ there exists some minimal point 
to which we can apply arguments as before. 

Theorem 2.4. For a given model prior U on ^ and some subset 
3^* C 3^ , assume that — Pq log(p*/po) < 00 and Po{p/p*) < 00 for all P € 
^ and P* G Suppose that there exist a sequence of strictly positive 



(2.16) 
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numbers e„ with e„ ^ and ne'^ — > oo and a constant L > 0, such that, for 
all n and all e > e„, 

(2.17) inf U{B{en,P*;Po)) > e"^"^', 

(2.18) Nt{e,^,d;Po,^*)<e''''-. 

Then for every sufficiently large constant M > 0, as n ^ oo, 

(2.19) Un{P£^:d{P,^*)>Men\Xu...,Xn)^0 in Li{P^). 

3. Mixtures. Let fi denote the Lebesgue measure on M. For each z € 
M, let X p{x\z) be a fixed //-probabihty density on a measurable space 
that depends measurably on {x,z), and for a distribution F on M 
define the ^-density 

Pf{x) = J p{x\z) dF{z). 

Let Pp be the corresponding probability measure. In this section we consider 
mixture models ^ = {Pp:F G where ^ is the set of all probability 
distributions on a given compact interval [— M, M] . We consider consistency 
for general mixtures and derive a rate of convergence in the special case that 
the family p{-\z) is the normal location family, that is, with (p the standard 
normal density, 

(3.1) pf{x)= J (j){x- z)dF{z). 

The observations are an i.i.d. sample Xi, . . . ,Xn drawn from a distribution 
Pq on {^,£/) with /x-density po which is not necessarily of the mixture 
form. As a prior for F, we use a Dirichlet process distribution (see [6, 7]) on 
^. 

3.1. General mixtures. We say that the model is Po-identifiable if, for 
aU pairs Fi,F2 G 

=^ Po(m/m)>o. 

Imposing this condition on the model excludes the first way in which nonunique- 
ness of P* may occur (as discussed in Section 2.4). 

Lemma 3.1. Assume that —PQlog{pF/po) < 00 for some F S If the 
map z p{x\z) is continuous for all x, then there exists an F* G ^ that 
minimizes F 1— > —Pq \og{pF /pq) over ,^ . If the model is Po-identifiable, then 
this F* is unique. 



BAYESIAN MISSPECIFICATION 



15 



Proof. If F„ is a sequence in ^ with F for the weak topology 

on then pF„{x) -^pf{x) for every x, since the kernel is continuous in z 
(and, hence, also bounded as a result of the compactness of [— M, M]) and by 
use of the portmanteau lemma. Consequently, pp^ Pf in Li{fi) by Scheffe's 
lemma. It follows that the map F ^ pp from ^ into Li{^) is continuous 
for the weak topology on ^ . The set ^ is compact for this topology, by 
Prohorov's theorem. The Kullback-Leibler divergence pi— > — Pq log(j'/po) is 
lower semi-continuous as a map from Li{fi) to M (see Lemma 8.2). Therefore, 
the map F i— > —Polog^pF/po) is lower semi-continuous on the compactum 
^ and, hence, attains its infimum on 

The map pi— > — Polog(p/po) is also convex. By the strict convexity of the 
function x i— > — logx, we have, for any A G (0, 1), 



unless Po{pi = P2) = 1- This shows that the point of minimum of F 1— > 
Polog{pF/po) is unique if ^ is Po-identifiable. □ 

Thus, a minimal Kullback-Leibler point Pf* exists and is unique under 
mild conditions on the kernel p. Because the model is convex. Lemma 2.3 
next shows that (2.9) is satisfied for the weighted Hellinger distance, whose 
square is equal to 



li Pq/pf* G LooifJ.), then this expression is bounded by the squared Hellinger 
distance H between the measures and Ppj. 

Because ^ is compact for the weak topology and the map F pp from 
^ to Li{^) is continuous (cf. the proof of Lemma 3.1), the model ^ = 
{Pf ■ F E ^} is compact relative to the total variation distance. Because the 
Hellinger and total variation distances define the same uniform structure, 
the model is also compact relative to the Hellinger distance and, hence, it 
is totally bounded, that is, the covering numbers N(e, ^ , H) are finite for 
all £. Combined with the result of the preceding paragraph and Lemmas 2.2 
and 2.3, this yields the result that the entropy condition of Corollary 2.1 
is satisfied for d as in (3.2) if po/PF* G Lo^[ii) and we obtain the following 
theorem. 



Theorem 3.1. If Pq/pf* G LooifJ-) and U{B {e , Pf* ; Pq)) > for every 
e>0, then lin{F : d{PF , Pf*) > e|Xi, . . . , X^) ^0 in Li(Po") for d given 
by (3.2). 




(3.2) 
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3.2. Gaussian mixtures. Next we specialize to the situation where p{x\z) = 
(p{x — z) is a Gaussian convolution kernel and derive the rate of convergence. 
The Gaussian convolution model is well known to be Po-identifiable if Pq is 
Lebesgue absolutely continuous (see, e.g., [12]). Let d be defined as in (3.2). 
We assume that Pq is such that —Polog{pp/po) is finite for some F, so that 
there exists a minimal Kullback-Leibler point F* , by Lemma 3.1. 

Lemma 3.2. If for some constant Ci > 0, d{pFi,PF2) < CiH{pFi,PF2) ; 
then the entropy condition 

logN{£n,^,d)<nel 
is satisfied for En a large enough multiple oflogn/^/n. 

Proof. Because the square of the Hellinger distance is bounded above 
by the Li-norm, the assumption implies that d^{PF^ , PF2) ^ C*? ll-f-Fi — Pf2 111- 
Hence, for all e > 0, we have N{Cie, ^,d) < N{e^, ^, ||-||i). As a result of 
Lemma 3.3 in [9], there exists a constant C2 > such that 



(3.3) \\Pf, - Pf2 111 < CI\\Pf, - Pf2 Hoc max <! 1, M, . /log^ ^ 



\Pf, - Pf: 



2 lloo 



from which it follows that A^(C|elog(l/e)i/2^ ^, |H|i) < 7V(e, ^, |H|oo) for 
small enough e. With the help of Lemma 3.2 in [9], we see that there exists 
a constant C3 > such that, for all < e < e~^, 

logAf(e,^,|H|oo)<C3flog-'' 



e , 

Combining all of the above, we note that, for small enough e > 0, 
logiv(^CiC2ei/2 l^log ^y^', ^, < log Af(e, ^, IHloo) 

So if we can find a sequence such that, for all n > 1, there exists an e > 
such that 

CiC2£^/^(\og-\'\£n and cJ \og-X <nel, 



e , 

then we have demonstrated that 

log iV(e„, ^, d) < log N [ciC^e^'^ {log d^ 



2 



<C3(log-) <nel. 
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One easily shows that this is the case for e„ = max{CiC2, CsKlogn/y^) 
(in which case we choose, for fixed n, e = 1/n), if n is taken large enough. 
□ 



We are now in position to apply Theorem 2.1. We consider, for given 
M > 0, the location mixtures (3.1) with the standard normal density </> as 
the kernel. We choose the prior 11 equal to a Dirichlet prior on ^ specified 
by a finite base measure a with compact support and positive, continuous 
Lebesgue density on [— M, M]. 

Theorem 3.2. Let Pq be a distribution on M dominated by Lebesgue 
measure fj,. Assume that po/pF* G Looi^j)- Then the posterior distribution 
concentrates its mass around Pp* asymptotically at the rate \ogn/^/n rela- 
tive to the distance d on ^ given by (3.2). 

Proof. The set of mixture densities pp with F £ is bounded above 
and below by the upper and lower envelope functions 

U{x) = (P{X + M)1{^^_M} + - ^)l{x>M} + HO)t{~M<x<M}, 

L{x) = <P{x - M)l|^<o} + + M)l|^.>o}. 
So for any F £ 

P.i^)<PoZ 



PF*J L 



+ Po{e l{x<-M} + e l{x>J\/}j < 



oo, 



because po is essentially bounded by a multiple of pp* and Pp* has sub- 
Gaussian tails. In view of Lemmas 2.2 and 2.3, the covering number for 
testing Nt{e, ^,d; Pq, P*) in (2.5) is bounded above by the ordinary met- 
ric covering number N{A£,^,d), for some constant A. Then Lemma 3.2 
demonstrates that the entropy condition (2.5) is satisfied for £n a large mul- 
tiple of log n/^/n. 

It suffices to verify the prior mass condition (2.4). Let e be given such 
that < e < e~^. By Lemma 3.2 in [9], there exists a discrete distribu- 
tion function F' G D[—M,AL] supported on at most N < C2log(l/e) points 
{zi,Z2, . . . , zn} C [— M, M] such that \\pp* — ppi ||oo < Cie, where Ci, C2 > 
are constants that depend on M only. We write F' = J^jLiPj^Zj ■ With- 
out loss of generality, we may assume that the set {zj:j = 1,...,A^} is 
2e-separated. Namely, if this is not the case, we may choose a maximal 
2e-separated subset of {zj-.j = 1,...,A^} and shift the weights pj to the 
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nearest point in the subset. A discrete F" obtained in this fashion satisfies 
Wpf' — Pf"\\oo ^ 2e||(/)'||oo- So by virtue of the triangle inequality and the 
fact that the derivative of the standard normal kernel (j) is bounded, a given 
F' may be replaced by a 2e-separated F" if the constant Ci is changed 
accordingly. 

By Lemma 3.3 in [9], there exists a constant Di such that the Li-norm 
of the difference satisfies 

\\Pf' - Pp'lli < log- 



for small enough e. Using Lemma 3.6 in [9], we note, moreover, that there 
exists a constant D2 such that, for any F G 

ll-Pp - Pf/||i < Da (^e + l-^I-^i - + ^] - Pj\^ ■ 

So there exists a constant I? > such that, if F satisfies Y^^=i \P[^j ~ ^) + 
e] — Pj\ < then 

\\PF-PF4i<De[\og-\ . 

Let Q{P) be the measure defined by dQ[P) = (po/pF*) dP. The assumption 
that Po/pf* is essentially bounded implies that there exists a constant K > 
such that \\Q{Pp^) - Q{Pf^)\\i < K\\Pf^ -FpJIi for ah Fi,F2 G ^. Since 
Q{Pf*) = -Pq, it follows that there exists a constant D' > such that, for 
small enough e > 0, 

|f G -e,Zj + e] -pj\ < e| 

C [f g ^: IIQ(Pp) -Polli < {D'feUog-^^'^ 



We have that dQ{PF)/dPo = pf/pf* and PoipF/PF*) < Po{U / L) < 00. The 
Hellinger distance is bounded by the square root of the Li-distance. There- 
fore, applying Lemma 8.1 with rj = r](e) = -D'e^/^(log(l/e))^/^, we see that 
the set of measures Pf with F in the set on the right-hand side of the last 
display is contained in the set 

\PF:Fe^, -Po log ^ < e{e),Po flog ^) ' < C'(e) j 

I PF* \ PF' J ) 

cB{C{e),PF*;Po), 
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where C(e) = -D"r/(e)(log(l/r?(e))) < D"D'e''/^{\og{l/e)f/'^, for an appropri- 
ate constant D", and small enough e. It follows that 



I. j=i ) 

Following [8] (Lemma 6.1) or Lemma A. 2 in [9], we see that the prior measure 
at the right-hand side of the previous display is lower bounded by 



ciexp(-C2Aflog(l/e))>exp(-L(log(l/e))2)>exp(-L'(log(l/C(e)))'), 



where ci > 1, C2 > are constants and L = C2C2 > 0. So if we can find a 
sequence such that, for each sufficiently large n, there exists an e > 
such that 



then n(S(e„,Pp.;^'o)) > U{B{C{e), Pp*; Pq)) > exp{-L'nel) and, hence, 
(2.4) is satisfied. One easily shows that, for e„ = logn/^/n and C{e) = Xj ^fn, 
the two requirements are fulfilled for sufficiently large n. □ 

4. Regression. Let Pq be the distribution of a random vector (X, y ) sat- 
isfying Y = fo{X) + eo for independent random variables X and cq taking 
values in a measurable space and in M, respectively, and a mea- 

surable function /q : — > M. The variables X and eo have given marginal 
distributions, which may be unknown, but are fixed throughout the follow- 
ing. The purpose is to estimate the regression function /o based on a ran- 
dom sample of variables {Xi,Yi), . . . , (X„, Y^) with the same distribution as 



A Bayesian approach to this problem might start from the specification of 
a prior distribution on a given class ^ of measurable functions / : — > M. 
If the distributions of X and eg are known, this is sufficient to determine a 
posterior. If these distributions are not known, then one might proceed to 
introduce priors for these unknowns as well. The approach we take here is to 
fix the distribution of eo to a normal or Laplace distribution, while aware of 
the fact that its true distribution may be different. We investigate the con- 
sequences of the resulting model misspecification. We shall show that mis- 
specification of the error distribution does not have serious consequences for 
estimation of the regression function. In this sense a nonparametric Bayesian 
approach possesses the same robustness to misspecification as minimum con- 
trast estimation using least squares or minimum absolute deviation. We shall 
also see that the use of the Laplace distribution requires no conditions on 
the tail of the distribution of the errors, whereas the normal distribution 
appears to give good results only if these tails are not too big. Thus, the tail 





{X,Y). 
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robustness of minimum absolute deviation versus the nonrobustness of the 
method of least squares also extends to Bayesian regression. 

We build the posterior based on a regression model Y = f{X) + e for X 
and e independent, as is the assumption on the true distribution of {X,Y). 
If we assume that the distribution Px of X has a known form, then this 
distribution cancels out of the expression for the posterior on /. If, instead, 
we put independent priors on / and Px , respectively, then the prior on Px 
would disappear upon marginalization of the posterior of {f,Px) relative 
to /. Thus, for investigating the posterior for /, we may assume without 
loss of generality that the marginal distribution of X is known. It can be 
absorbed into the dominating measure ^ for the model. 

For / G let Pf be the distribution of the random variable {X,Y) sat- 
isfying Y = f{X) + e for X and e independent variables, X having the same 
distribution as before and e possessing a given density p, possibly different 
from the density of the true error cq. We shall consider the cases that p is 
normal and Laplace. Given a prior 11 on the posterior distribution for / 
is given by 

J^m^My^-f{X^))dU{f) 

jni=MYi-f{x.))duif) ■ 

We shall show that this distribution concentrates near Jq + Eeo in the case 
that p is a normal density and near /o + median(eo) if p is Laplace, if these 
translates of the true regression function /o are contained in the model 
If the prior is misspecified also in the sense that fo + fi^ (where fi is the 
expectation or median of eo), then, under some conditions, this remains true 
with /o replaced by a "projection" /* of /o on In agreement with the 
notation in the rest of the paper, we shall denote the true distribution of an 
observation {X,Y) by Pq (stressing that, in general, Pq is different from Pf 
with / = 0). The model ^ as in the statement of the main results is the set 
of ah distributions Pj on ^ xR with f £ r^. 

4.1. Normal regression. Suppose that the density p is equal to the stan- 
dard normal density p{z) = {2tt)~^/'^ exp(— Then, with fi = Ecq, 

log ^(X,y) = -lif- fo)\X) + eo(/ - fo)iX), 
Pfo 2 

(4.1) 

Pfo 2 2 

It follows that the Kullback-Leibler divergence / 1— > — Polog(p//po) is min- 
imized for f = f* £ minimizing the map f ^ Po{f — fo — fJ-)"^- 

In particular, if /q + E then the minimizer is /o + /u and Pf^+fi is the 
point in the model that is closest to Pq in the Kullback-Leibler sense. If also 
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/i = 0, then, even though the posterior on Pf will concentrate asymptotically 
near Pf^ , which is typically not equal to Pq , the induced posterior on / will 
concentrate near the true regression function /q. This favorable property of 
Bayesian estimation is analogous to that of least squares estimators, also for 
nonnormal error distributions. 

If /o + /i is not contained in the model, then the posterior for / will in 
general not be consistent. We assume that there exists a unique f*Gc^ 
that minimizes f ^ Po{f — fo — /i)^, as is the case, for instance, if is a 
closed, convex subset of L2{Po)- Under some conditions we shall show that 
the posterior concentrates asymptotically near /*. If n = 0, then /* is the 
projection of /o into ^ and the posterior still behaves in a desirable manner. 
For simplicity of notation, we assume that Eoeo = 0. 

The following lemma shows that (2.8) is satisfied for a multiple of the 
L2(Po)-distance on 

Lemma 4.1. Let ^ be a class of uniformly bounded functions / : ^ — > M 
such that either foG or ^ is convex and closed in L2{Pq). Assume that /q 
is uniformly bounded, that EqCo = and that Eoe^'^l*^"' < oo for every M > 
0. Then there exist positive constants Ci,C2,C3 such that, for all m G N, 
/, /i, • • • , /m e ^ and Ai, . . . , Am > with J2i K = 1, 

Polog^<-^Po(/-r)', 

(4.2) pJiogPfiy <CiPo{f-f*f, 

V Pf J 

sup -logPof^^^)" > C2Y.HPo{f^ - f*? - CsPoif - /.)')• 

Proof. We have 

(4.3) log^(X,y) = -i[(/o - ff - (/o - f*)\X) - eo(r - f){X). 

Pf, 2 

The second term on the right-hand side has mean zero by assumption. The 
first term on the right-hand side has expectation — iPo(/* — ff if fo = f*: as 
is the case if /o G Furthermore, if =^ is convex, the minimizing property 
of /* imphes that Po(/o -/*)(/*-/)> for every / G =^ and then the 
expectation of the first term on the right-hand side is bounded above by 
-iPo(/* - ff - Therefore, in both cases (4.2) holds. 

From (4.3) we also have, with M a uniform upper bound on and /o, 

Po flog ^) ' < Po[(r - /)'(2/o -f-f*f]+ IPoelPoir - ff, 
\ Pf * ) 
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log 



PL 



< Po[{r - ffm - / - r? + 2ei{r - f?] 

^ g2a(M2+M|eo|)_ 



Both right-hand sides can be further bounded by a constant times Po{f — 
/*)^, where the constant depends on M and the distribution of cq only. 

In view of Lemma 4.3 (below) with p = Pf* and qi = Pf ,, we see that there 
exists a constant C > depending on M only such that, for all Aj > with 



(4.4) 



1-Pn 



J2i ^tPh 
Pf, 



aPo log 



Pf* 



<2a^CY,XiPo{fi-rf. 



By Lemma 4.3 with a = 1 and p = Pf and similar arguments, we also have 
that, for any f £ 



Pf 



l-Po 
For A,/ = 1 this becomes 
1-Po 



Po^og 



Pf 



El >^iPf, 



<2CY,X,Po{h-ff. 



Ph\ 
Pf ) 



^olog 



PL 

'Pf^ 



Taking differences, we obtain that 



^'olog 



<2CP^{fi-ff. 



<^CY,\Po{h-ff 



By the fact that log(a6) = log a + log 6 for every a,f) > 0, this inequality re- 
mains true if / on the left is replaced by /* . Combine the resulting inequality 
with (4.4) to find that 



1-Pn 



Pf 



>a^A,Polog^ 

i Pf^ 

- 2a^Cj2X,Poir - h? - ^CY^XiPoih - ff 

i i 

> (I - 2a^c) ^^^o(r - /.)' - 4C5: A,Po(/^ - /)', 

i i 

where we have used (4.2). For sufficiently small a > and suitable con- 
stants C2 , C3 , the right-hand side is bounded below by the right-hand side 
of the lemma. Finally the left-hand side of the lemma can be bounded by 
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the supremum over a G (0, 1) of the left-hand side of the last display, since 
— log X > 1 — X for every x > 0. □ 

In view of the preceding lemma, the estimation of the quantities involved 
in the main theorems can be based on the L2{Po) distance. 

The "neighborhoods" B(£,Pf*;P()) involved in the prior mass conditions 

(2.4) and (2.11) can be interpreted in the form 

B{e, Pf, ■p,) = {f(^^:Po{f- /o)2 < Po(/* - /o)2 +e^Poif- ff < e^}. 

If PoU - /*)(/* - /o) = for every / G ^ (as is the case if /* = /o or if /* 
lies in the interior of ,^), then this reduces to an L2{Po)-ball around /* by 
Pythagoras' theorem. 

In view of the preceding lemma and Lemma 2.1, the entropy for testing in 

(2.5) can be replaced by the local entropy of for the L2(-Fb)-™etric. The 
rate of convergence of the posterior distribution guaranteed by Theorem 2.1 
is then also relative to the L2(-Po)-distance. These observations yield the 
following theorem. 

Theorem 4.1. Assume the assertions of Lemma 4.1 and, in addition, 
that Poif — /*)(/* — /o) = for every f G If Sn is a sequence of strictly 
positive numbers with — > and ne^ — > oo such that, for a constant L > 
and all n, 

(4.5) n(/ G ^ : Po(/ - f*? < el) > e"^"^", 

(4.6) iV(en,=^,||-||Po,2)<e""', 

then n„(/ G ^:Po(/ - f* f > . . . , X„) ^ m Li{P^), for every 

sufficiently large constant M . 

There are many special cases of interest of this theorem and the more 
general results that can be obtained from Theorems 2.1 and 2.2 using the 
preceding reasoning. Some of these are considered in the context of the well- 
specified regression model [14]. The necessary estimates on the prior mass 
and the entropy are not different for problems other than the regression 
model. Entropy estimates can also be found in work on rates of convergence 
of minimum contrast estimators. For these reasons we exclude a discussion 
of concrete examples. 

The following pair of lemmas was used in the proof of the preceding 
results. 



Lemma 4.2. There exists a universal constant C such that, for any prob- 
ability measure Pq and any finite measures P and Q and any < a < 1, 

2 



l-P 



1 



aPo log 



P 



< a'CP 



log 



P 



P 



1 



{q>p} + ^{q<p} 
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Proof. The function R defined by R{x) = {e^ — I — x)/ {x'^e^) for x > 
and R{x) = (e^ — 1 — for x < is uniformly bounded on M by a constant 

C. We can write 



1 + aPo log ■ 



P 



+ PoR[ alog- alog 



P 



The lemma follows. □ 



P 



2 r 



1 



{q>p} + Hq<p} 



Lemma 4.3. There exists a universal constant C such that, for any 
probability measure Pq and all finite measures P,Qi, . . . ,Qm and constants 
< a < 1, Ai > with Y.i Ai = 1, 



1-Pn 



P 



aPo log 



P 



<2a2c^A,Po(log| 



^1 +1 
P 



Proof. In view of Lemma 4.2 with q = XiQi, it suffices to bound 



log 



by the right-hand side of the lemma. We can replace a in the display by 2 
and make the expression larger. Next we bound the two terms corresponding 
to the decomposition by indicators separately. 
By the convexity of the map x xlogx, 



log- 



P 



P 



P 



If Ej Aj(?j > p, then the left-hand side is positive and the inequality is pre- 
served when we square on both sides. Convexity of the map x i-^ x^ allows 
us to bound the square of the right-hand side as in the lemma. 
By the concavity of the logarithm, 

1 Ei Mi ^ \ 1 ^i 

- log < - > Ai log — . 

P i P 

On the set Ej ^iQ.i < P the left-hand side is positive and we can again take 
squares on both sides and preserve the inequality. □ 

4.2. Laplace regression. Suppose that the error-density p is equal to the 
Laplace density p{x) = ^exp(— |x|). Then 

log = -(|eo + h{X) - f{X)\ - |eo|), 

Ph 

-Polog^=Po^(/-/o), 
Ph 
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for ^{i') = Eo(|eo — — |eo|)- The function <l> is minimized over G M at the 
median of cq- It follows that if /o + m, for m the median of eo, is contained 
in then the Kullback-Leibler divergence —P()log{pf /po) is minimized 
over / G ^ at / = /o + m. If ^ is a compact, convex subset of Li{Pq), 
then in any case there exists f*£c^ that minimizes the Kullback-Leibler 
divergence, but it appears difficult to determine this concretely in general. 
For simplicity of notation, we shall assume that m = 0. 

If the distribution of eo is smooth, then the function ^ will be smooth 
too. Because it is minimal at = m = 0, it is reasonable to expect that, for 

in a neighborhood of ?n = and some positive constant Co, 

(4.7) $(i/) = Eo(|eo |eo|) > CoW\^. 

Because $ is convex, it is also reasonable to expect that its second derivative, 
if it exists, is strictly positive. 

Lemma 4.4. Let ^ be a class of uniformly bounded functions / : ,i?r ^ M 
and let /o be uniformly bounded. Assume that either fo £ ^ and (4.7) holds, 
or that ^ is convex and compact in Li{Pq) and that $ is twice continuously 
differentiable with strictly positive second derivative. Then there exist posi- 
tive constants Co, Ci, C2, C3 such that, for all m^N, /, /i, . . . , /m G ^ and 
Ai, . . . , Am, > with J2i Aj = 1, 

^olog^<-CoPo(/ -/*)', 
Pf* 

(4.8) pJiogPllV <C,Po{f-f*f, 

V Pf J 

sup -logPo f^i^y > c2Y.Kmh - r? - cMf - fi)^). 

Proof. Suppose first that /o € ^ , so that /* = /o- As <I> is monotone 
on (0,00) and (— oo,0), inequality (4.7) is automatically also satisfied for u 
in a given compactum (with Cq depending on the compactum). Choosing 
the compactum large enough such that (/ — f*){X) is contained in it with 
probability one, we conclude that (4.8) holds (with /o = /*)• 

If /* is not contained in ^ but ^ is convex, we obtain a similar inequality 
with /* replacing /o, as follows. Because /* minimizes / 1-^ Po^if — fo) over 
^ and ft = {I- t)f* + 1/ G ^ for t G [0, 1], the right derivative of the map 
t ^ Po^ift - fo) is nonnegative at t = 0. This yields Po^'{f* - fo){f - /*) > 
0. By a Taylor expansion, 

Polog^ = PoW - fo) - Hf* - fo)) 

Pf 



= Po^'if* - fo)if - f*) + 2^o$"(/ - /o)(/ - r)', 
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for some / between / and /*. The first term on the right-hand side is 
nonnegative and the function $" is bounded away from zero on compacta 
by assumption. Thus, the right-hand side is bounded below by a constant 
times Po(/ — /*)^ and again (4.8) follows. 

Because log{pf /pf*) is bounded in absolute value by \f — f*\, we also 
have, with M a uniform upper bound on and /o. 



As in the proof of Lemma 4.1, we can combine these inequalities, (4.8) and 
Lemma 4.3 to obtain the result. □ 

As in the case of regression using the normal density for the error-distri- 
bution, the preceding lemma reduces the entropy calculations for the ap- 
plication of Theorem 2.1 to estimates of the -L2(i-b)-entropy of the class of 
regression functions The resulting rate of convergence is the same as in 
the case where a normal distribution is used for the error. A difference with 
the normal case is that presently no tail conditions of the type Eoe^l*^"' < oo 
are necessary. Instead the lemma assumes a certain smoothness of the true 
distribution of the error cq. 

5. Parametric models. The behavior of posterior distributions for finite- 
dimensional, misspecified models was considered in [1] and more recently 
by Bunke and Milhaud [3] (see also the references in the latter). In this 
section we show that the basic result that the posterior concentrates near 
a minimal Kullback-Leibler point at the rate ^/n follows from our general 
theorems under some natural conditions. We first consider models indexed 
by a parameter in a general metric space and relate the rate of convergence 
to the metric entropy of the parameter set. Next we specialize to Euclidean 
parameter sets. 

Let {pg : 6 £ @} he a collection of probability densities indexed by a pa- 
rameter in a metric space {Q,d). Let Pq be the true distribution of the 
data and assume that there exists a 9* £Q, such that, for all 9, 61,62 G 
and some constant C > 0, 





(5.1) 



Polog — < -C(f{6,6*) 



(5.2) 




(5.3) 
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The first inequality implies that 9* is a point of minimal Kullback-Leibler 
divergence 9^ — Pq Iog(p6»/Po) between Pq and the model. The second and 
third conditions are (integrated) Lipschitz conditions on the dependence of 
Pq on 9. The following lemma shows that in the application of Theorems 2.1 
and 2.2 these conditions allow one to replace the entropy for testing by the 
local entropy of relative to (a multiple of) the natural metric d. 

In examples it may be worthwhile to relax the conditions somewhat. In 
particular, the conditions (5.2)-(5.3) can be "localized." Rather than assum- 
ing that they are valid for every ^i, ^2 £ ©) the same results can be obtained 
if they are valid for every pair (^1,^2) with d(9i,92) sufficiently small and 
every pair (6*1,^2) with arbitrary 9i and 92 = 9*. For 6*2 = 9* and Pq = Pg* 
(i.e., the well-specified situation), condition (5.2) is a bound on the Hellinger 
distance between Pq* and Pq-^ . 



Lemma 5.1. Under the preceding conditions, there exist positive con- 
stants Ci, C2 such that, for all m E N, ^, 0i, . . . , G G and Ai, . . . , Am > 

with J2i'^i = ^> 



Y,\d\9i,9*)-CiY.\d\9A)<C2 sup -logPo 

0<o<l 



Proof. In view of Lemma 5.3 (below) with p ■ 
there exists a constant C such that 



: P0* 



Pe* 

(5.2) and (5.3), 



(5.4) 



1-Pn 



hPe, 
Pe* 



qPo log 



Pe* 



Ei hPe, 



<2a^CY^\d\9i,9* 



By Lemma 5.3 with a = 1, p = Pe-, (5-2) and (5.3), 



1-Pn 



Pe 



Pol lo: 



Pe 



T,i hpe. 



<2CY,Kd^{9u9). 



We can evaluate this with Aj = 1 (for each i in turn) and next subtract the 
convex combination of the resulting inequalities from the preceding display 
to obtain 



Po(log^)-EA.^o(log^) 



<ACY,Kd\9i,9). 



By the additivity of the logarithm, this remains valid if we replace 9 on the 
left-hand side by 9* . Combining the resulting inequality with (5.1) and (5.4), 
we obtain 



1-Pn 



Pe* 



>aY,\id\9i,9*){C -2a) - ACY,\,d\9,,9). 
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The lemma follows upon choosing a > sufficiently small and using 
— logx > 1 — X. □ 

If the prior on the model {pg : 6 G 0} is induced by a prior on the parame- 
ter set 0, then the prior mass condition (2.11) translates into a lower bound 
for the prior mass of the set 

I Pe* \ Pe*J J 

In addition to (5.1), it is reasonable to assume a lower bound of the form 

(5.5) Polog^>-Cd\e,9*), 

Pe* 

at least for small values of d{9,9*). This together with (5.3) implies that 
B{e, 9*;Po) contains a ball of the form {9 : d{9, 9*) < Cie} for small enough e. 
Thus, in the verification of (2.4) or (2.11) we may replace B(e,P*;Po) by a 
ball of radius e around 9* . These observations lead to the following theorem. 



Theorem 5.1. Let (5.1)-(5.5) hold. If for sufficiently small A and C, 
sup \ogN{Ae,{9eQ:e<d{9,9*)<2e],d)<nel, 

£>£n 

Jl{9:jen<d{9,9*)<2jen) 



U{9:d{9,9*)<Cen) 
then U{9 : d{9, 9*) > M„e„|Xi, . . . , X„) ^ m Li{P^) for any M„ ^ oo. 

5.1. Finite- dimensional models. Let be an open subset of m-dimensional 
Euclidean space equipped with the Euclidean distance d and let {pg -.9 £ Q} 
be a model satisfying (5.1)-(5.5). 

Then the local covering numbers as in the preceding theorem satisfy, for 
some constant B, 

N{Ae, {9€e:e< d{9, 9*) < 2e},d) < (^^ 

(see, e.g., [8], Section 5). In view of Lemma 2.2, condition (2.5) is satisfied 
for En a large multiple of 1/ y/n. If the prior 11 on possesses a density that 
is bounded away from zero and infinity, then 

U{9:d{9,9*)<je) 
UiB{e,9*;Po)) " ' 

for some constant C2. It follows that (2.11) is satisfied for the same En- 
Hence, the posterior concentrates at rate l/\/n near the point 9*. 
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The preceding situation arises if the minimal point 0* is interior to the 
parameter set 0. An example is fitting an exponential family, such as the 
Gaussian model, to observations that are not sampled from an element of the 
family. If the minimal point 6* is not interior to Q, then we cannot expect 
(5.1) to hold for the natural distance and different rates of convergence may 
arise. We include a simple example of the latter type, which is somewhat 
surprising. 

Example. Suppose that Pq is the standard normal distribution and 
the model consists of all A'^(0, l)-distributions with ^ > 1. The minimal 
Kullback-Leibler point is ^* = 1. If the prior possesses a density on [l,cxo) 
that is bounded away from and infinity near 1, then the posterior concen- 
trates near 9* at the rate 1/n. 

One easily shows that 

-Poiog^ = J(^-r)(0 + r), 

, , Pe* 2 

(5.6) 

-logPof — y = ^a{0 - 9*){e + 9* - a{9 - 9*)). 

This shows that (2.9) is satisfied for a multiple of the metric d{p0-^,p02) = 
\/\9i — ^2! on = [1, 00). Its strengthening (2.8) can be verified by the same 
methods as before, or, alternatively, the existence of suitable tests can be 
established directly based on the special nature of the normal location family. 
[A suitable test for an interval (^1,^2) can be obtained from a suitable test 
for its left end point.] The entropy and prior mass can be estimated as in 
regular parametric models and conditions (2.5)-(2.11) can be shown to be 
satisfied for e„ a large multiple of This yields the rate l/\/n relative 

to the metric \/\9i — ^2! and, hence, the rate 1/n in the natural metric. 

Theorem 2.2 only gives an upper bound on the rate of convergence. In 
the present situation this appears to be sharp. For instance, for a uniform 
prior on [1, 2], the posterior mass of the interval [c, 2] can be seen to be, with 

$(2^ - Zn) - $(C^ -Zn) _ V^-Zn ^(_i/2)(c2-l)n+Z„(c-l)v/H 



$(2^-Z„)-$(^-Z„) c^-Zr, 

where we use Mills' ratio to see that ^{un) — ^{xn) ~ {^/ Xn)(t>{xn) if Xn,yn^ 
c G (0, 1) such that Xn/Un 0. This is bounded away from zero for c = = 
1 + C/n and fixed C. 

Lemma 5.2. There exists a universal constant C such that for any prob- 
ability measure Pq and any finite measures P and Q and any < a < 1, 



l-Po(-) -aPolog- 

\pJ q 



< a^CPo 



2 2 
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Lemma 5.3. There exists a universal constant C such that, for any 
probability measure Pq and any finite measures P, Qi, ■ ■ ■ iQm cind any 
Ai, . . . , Xm > with J2i \ = ^ o-iT'd < a < 1, the following inequality holds: 



Proofs. The function R defined by R{x) = (e^ - 1 - x)/a^(e^/^" - 1)^ 
for X > and R{x) = (e^ — 1 — x) jx^ for x < is uniformly bounded on M by 
a constant C, independent of q G (0, 1]. [This may be proved by noting that 
the functions (e^ — l)/a(e"^ — 1) and (e^ — 1 — x)/(e^/^ — 1)^ are bounded, 
where this fohows for the first by developing the exponentials in their power 
series.] For the proof of the first lemma, we can proceed as in the proof 
of Lemma 4.2. For the proof of the second lemma, we proceed as in the 
proof of Lemma 4.3, this time also making use of the convexity of the map 
x i-^' I — Ip on [0, oo). □ 

6. Existence of tests. The proofs of Theorems 2.1 and 2.2 rely on tests 
of Pq versus the positive, finite measures Q{P) obtained from points P that 
are at positive distance from the set of points with minimal Kullback-Leibler 
divergence. Because we need to test Pq against finite measures (i.e., not 
necessarily probability measures), known results on tests using the Hellinger 
distance, such as in [10] or [8], do not apply. It turns out that in this situation 
the Hellinger distance may not be appropriate and instead we use the full 
Hellinger transform. The aim of this section is to prove the existence of 
suitable tests and give upper bounds on their power. We first formulate 
the results in a general notation and then specialize to the application in 
misspecified models. 

6.1. General setup. Let P be a probability measure on a measurable 
space (playing the role of Pq) and let ^ he a class of finite measures 

on {^,£/) [playing the role of the measures Q with dQ = (po/p*) dP]. We 
wish to bound the minimax risk for testing P versus defined by 



where the infimum is taken over all measurable functions (/) : ^ — > [0, 1] . Let 
conv(=S) denote the convex hull of the set 





tt{P, 3) = inf sup (P</> + Q(l - 0)) 
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Lemma 6.1. // there exists a a- finite measure that dominates all Q € 
then 

■k{P,^)= sup {P{p<q) + Q{p>q)). 

QGconv(^) 

Moreover, there exists a test 4> that attains the infimum in the definition of 

Proof. If /i' is a measure dominating ^S, then a cr-finite measure fi 
exists that dominates both ^ and P (e.g., fi = fj,' + P). Let p and q be 
/i-densities of P and Q, for every Q £ ^. The set of test-functions cj) can 
be identified with the positive unit bah $ of Loo(^ ,£/ , n), which is dual 
to Li{.S^ , £/ , fi) , since fi is u-finite. If equipped with the weak-* topology, 
the positive unit ball ^ is Hausdorff and compact by the Banach-Alaoglu 
theorem (see, e.g., [11], Theorem 2.6.18, and note that the positive func- 
tions form a closed and convex subset of the unit ball). The convex hull 
conv(cS) (or rather the corresponding set of /i-densities) is a convex subset 
of Li{^ , .s/, fi). The map 

Loo(^,=2^,Ai) X Li(jr,i2/,/i) ^M, 

(0,g)^,/.p + (i-</.)Q 

is concave in Q and convex in (j). [Note that in the current context we write 
(j)P instead of P(j), in accordance with the fact that we consider i;^ as a 
bounded linear functional on Li(,^, iz/, /x).] Moreover, the map is weak-*- 
continuous in (j) for every fixed Q [note that every weak-*-converging net 

0a ——^ <P by definition satisfies (paQ — > <PQ for all Q G Li(^,^,/x)]. The 
conditions for application of the minimax theorem (see, e.g., [15], page 239) 
are satisfied and we conclude 

inf sup {(j)P + {I - cp)Q) = sup \Tii{(j)P + {l-cp)Q). 

'^S* QGconv{j2) Q6conv(j2) '/^e* 

The expression on the left-hand side is the minimax testing risk tt{P,^). 
The infimum on the right-hand side is attained at the point (p = l{p < q}i 
which leads to the first assertion of the lemma upon substitution. 

The second assertion of the lemma follows because the function (p i— > 
sup{4>P + {1 — (j))Q : Q E conv(<S)} is a supremum of weak-*-continuous func- 
tions and, hence, attains its minimum on the compactum <I>. □ 

It is possible to express the right-hand side of the preceding lemma in 
the Li-distance between P and Q, but this is not useful for the following. 
Instead, we use a bound in terms of the Hellinger transform PaiP, Q) defined 
by, for < a < 1 , 

Pa{P,Q)= Jp'^q'-'^dfi. 
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By Holder's inequality, this quantity is finite for all finite measures P and 
Q. The definition is independent of the choice of dominating measure 
For any pair (P, Q) and every a G (0, 1), we can bound 

P{p<q)+Q{p>q)= pdfi+ qdfi 
Jp<q ■Jp>q 

(6.1) < [ p''q^-°'dfi+ j p^q^-^'dfi 

Jp<q Jp><l 
= Pa{P,Q). 

Hence, the right-hand side of the preceding lemma is bounded by supg Pa{P, Q) 
for all a G (0,1). The advantage of this bound is the fact that it factorizes 
if P and Q are product measures. For ease of notation, define 

Pa(^,^) =sup{pa(P,Q):PG conv(^),g G conv(^)}. 



Lemma 6.2. For any < a < 1 and classes 3^\, ^i, ^2 of finite 
measures, 

Pai^l X =^2,^1 X^2)<Pa(^l,^l)Pa(^2,^2), 

where x ^2 denotes the class of product measures {Pi x P2:Pi£ ^1, 
P2 G ^2}. 

Proof. Let P G conv(^i x ^2) and Q G conv(^i x ^2) be given. Since 
both are (finite) convex combinations, cr-finite measures /ii and fi2 can al- 
ways be found such that both P and Q have //i x /i2 densities which both 
can be written in the form of a finite convex combination as 

p{x,y) = ^>'iPii{x)p2i{y), Ai>o, ^Ai = i, 

i i 

q{x,y) = ^Kjqij{x)q2j{y), Kj>0, ^Kj = l, 
j j 
for fii X /i2-almost-all pairs (x, y) £ ^ x ^ . Here pu and q\j are ^i-densities 
for measures belonging to and respectively (and, analogously, p2i 
and q2j are /i2-densities for measures in S^2 and =^2). This implies that we 
can write 

p°'q^^"' d{fii X fi2) 

Ei \pii {x)P2i {y)Y( Ej i^jqij {x)q2j (y) \ ^ 



Y.i>'iPuix) J \ Y^ji^jqiji^) 

\ a. / \ l—Q. 

J2 ^iPlii^) I^Jl^ji^) dfll{x) 



df^2iy) 
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(where, as usual, the integrand of the inner integral is taken equal to zero 
whenever the /ii -density equals zero). The inner integral is bounded by 
Pa{^2-,S2) for every fixed x S 1^ . After substituting this upper bound, 
the remaining integral is bounded by pa{^^i-,^i)- □ 

Combining (6.1) with Lemmas 6.2 and 6.1, we obtain the following theo- 
rem. 



Theorem 6.1. If P is a probability measure on and ^ is a 

dominated set of finite measures on then, for every n>l, there 

exists a test (pn '■ — > [0, 1] such that, for all^ <a <\, 

sup (P>„ + Q"(l - </>„)) < p,(P, BT. 

The bound given by the theorem is useful only if pa{P, =S) < 1. For prob- 
ability measures P and Q, we have 



Pi/2(P,Q) = l-i JiVp-VQfdfi 



and, hence, we might use the bound with a = 1/2 if the Hellinger distance 
of conv(cS) to P is positive. For a general finite measure Q, the quan- 
tity Pi/2{P,Q) may be bigger than 1 and, depending on Q, the Hellinger 
transform Pa{P^Q) may even lie above 1 for every a. The following lemma 
shows that this is controlled by the (generalized) Kullback-Leibler diver- 
gence —P\og{q/p). 

Lemma 6.3. For a probability measure P and a finite measure Q, the 
function a i— > PaiQ, P) is convex on [0, 1] with PaiQ, P) P{q. > 0) as a [Q, 
PaiQ, P) Qip > 0) as a] 1 and 

dpaiQ,P) 



da 

(which may be equal to —oo). 



= Plog^ l,>o 

a=0 \P/ 



Proof. The function a ^ e"^ is convex on (0, 1) for all y G [—00,00), 
implying the convexity of a ^ pa{Q,P) = P{q/p)^ on (0, 1). The function 
continuous on [0, 1] for any y > 0, is decreasing for y < 1, 
increasing for y > 1 and constant for y = 1. By monotone convergence, as 
aiO, 
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By the dominated convergence theorem, with dominating function (p/g)" x 
l{p>g} < (p/'7)^'^^l{p>ij} for a< 1/2, we have, as a ^ 0, 

Combining the two preceding displays above, we see that pi^a{Q, P) = 
Q{p/qT Q{p > 0) as a i 0. 

By the convexity of the function a i— > e'^y , the map a fa{y) = — 
l)/a decreases, as a | 0, to {d / dQ)\a=Q f a{y) = V-, for every y. For y < 0, we 
have < 0, while, for y > 0, by Taylor's formula, 

fa{y)< sup ye^'s' < ye"^ < -e^^+^^S'. 

0<a'<Q £ 

Hence, we conclude that fa{y) < V e~^e^"+^'''^ly>o- Consequently, the quo- 
tient a~"^(e"'°s^'^/^-* — 1) decreases to log(q'/p) as a J, and is bounded above 
by V £~^{q/p)'^^tq>p for small a > 0, which is P-integrable for 2e < 1. We 
conclude that 

l(p„(Q,P) - po(Q,^)) = ^^((0" - l) i Plog(^)l,>o, 
as a I 0, by the monotone convergence theorem. □ 

Two typical graphs of the Hellinger transform a ^ Pa{Q,P) are shown in 
Figure 1 [corresponding to fitting a unit variance normal location model in 
a situation where the observations are sampled from a A^(0, 2)-distribution]. 
For P a probability measure with P <^Q, the Hellinger transform is equal to 
1 at a = 0, but will eventually increase to a level that is above 1 near q = 1 
if Q{p > 0) > 1. Unless the slope Plog(p/g) is negative, it will never decrease 
below the level 1. For probability measures P and Q, this slope equals minus 
the Kullback-Leibler distance and, hence, is strictly negative unless P = Q. 
In that case, the graph is strictly below 1 on (0,1) and Pi/2{P,Q) is a 
convenient choice to work with. For a general finite measure Q, the Hellinger 
transform PaiQ,P) is guaranteed to assume values strictly less than 1 near 
a = 0, provided that the Kullback-Leibler divergence Plog{p/q) is negative, 
which is not automatically the case. For testing a composite alternative =S, 
we shall need that this is the case uniformly in Q G conv(cS). For a convex 
alternative J2, Theorem 6.1 guarantees the existence of tests based on n 

2 

observations with error probabilities bounded by e""*" if 

£ < sup sup log ■ 



0<Q<lQe^ Pa[Q,P) 

In some of the examples we can achieve inequalities of this type by bound- 
ing the right-hand side below by a (uniform) Taylor expansion of a ^ 
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—I 1 1 1 1 1—^ T 1 1 1 r n 

0,0 0-2 0,* 0,6 0-8 1,0 0.0 0,2 0,4 0.6 0.8 1,0 



Fig. 1. The Hellinger transforms a pa(Q,P), for P — N{0,2) and Q, 
respectively, the measure defined by dQ — {dN{3/2,l)/dN{0,l)) dP (left) and 
dQ = {dN (3/2,1) /dN{l,l)) dP (right). Intercepts with the vertical axis at the right and 
left of the graphs equal P{q > 0) and Q{p > 0), respectively. The slope at equals (minus) 
the Kullback-Leibler divergence Plog{p/q). 

— log pa{P,Q) in a near a = 0. Such arguments are not mere technical gen- 
erahzations: they can be necessary aheady to prove posterior consistency 
relative to misspecified standard parametric models. 

If P{q = 0) > 0, then the Hellinger transform is strictly less than 1 at a = 
and, hence, good tests exist, even though it may be true that Pi/2{P, Q) > 1- 
The existence of good tests is obvious in this case, since we can reject Q if 
the observations land in the set g = 0. 

In the above we have assumed that =S is dominated. If this is not the case, 
then the results go through, provided that we use Le Cam's generalized tests 
[10], that is, we define 

Tr{P, 2) = inf sup + (1 - (^)Q), 

where the infimum is taken over the set of all continuous, positive linear maps 
(/) : Li(j?r, M such that ^P < 1 for all probability measures P. This 

collection of functionals includes the linear maps that arise from integration 
of measurable functions (p: ^ [0,1], but may be larger. Such tests would 
be good enough for our purposes, but the generality appears to have little 
additional value for our application to misspecified models. 
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The next step is to extend the upper bound to alternatives ^ that are 
possibly not convex. We are particularly interested in alternatives that are 
complements of balls around P in some metric. Let Lf , £/) be the set of 
finite measures on and let T:Lf{^,£/) x Lf{^,£/) i— > M be such 

that t(P, •) : ^ M is a nonnegative function (written in a notation so as to 
suggest a distance from P to Q). For Q £ ^, set 

(6.2) fHP,Q)= sup log i 

0<a<l Pa[^,W) 

For e > 0, define N-j-^e,^) to be the minimal number of convex subsets of 
{Q£Lt{M',s^):f{P,Q)>£/2} needed to cover {Q £ ^ : e < t{P,Q) < 2e} 
and assume that ^ is such that this number is finite for all e > 0. (The 
requirement that these convex subsets have r-distance e/2 to P is essential.) 
Then the following theorem applies. 

Theorem 6.2. Let P be a probability measure and ^ be a dominated 
set of finite measures on Then for all e > and all n>l, there 

exists a test such that, for all J G N, 



(6.3) 



P^<Pn<Y.^r{je,^)e-^^"'"/\ 



sup Q"(l -</>„)< e-"-^ ^ 

{Q:T{P,Q)>Je} 



Proof. Fix n > 1 and e > and define = {Q £ ^:je < t{P, Q) < 
(j + l)e}. By assumption, there exists for every j > 1 a finite cover of 
by Nj = N-j-^je, ^) convex sets Cj^i, . . . ,Cj^Nj of finite measures, with the 
further property that 

(6.4) ^mi^f{P,Q)>^-^, l<i<N,. 

According to Theorem 6.1, for all n > 1 and for each set Cj^i, there exists a 
test 4'n,j,i such that, for all a £ (0, 1), we have 

P^4'n,j,i ^ Pa{P-, Cja)"^, 



sup Q''{l-cPn,j,i)<PaiP,Cj,iy 



By (6.4), we have 



sup inf Pa{P,Q)= sup e-^'(^'Q) < e-j'^'/^ 

For fixed P and Q, the function a ^ Pa{P, Q) is convex and can be extended 
continuously to a convex function on [0, 1]. The function Q Pa{P, Q) with 
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domain Lf{^,^) is concave. By the minimax theorem (see, e.g., [15], 
page 239) , the left-hand side of the preceding display equals 

inf sup Pa{P,Q)= inf pa{P,Cj^i). 

0<a<l Qgc-j i 0<a<l 

It follows that 

P"</.„j,iV sup Q"(l-</.nj,0<e""'"''/^- 
Now define a new test function by 

(pn = sup max (j)n.j,i- 
j>l ^<i<Nj 

Then, for every J > 1, 

oo oo 
j=l 1=1 j=l 

sup - 0„) < sup max sup - c^nji) < sup e""^'^'/^ = e-"-^'^'/^ 

Qej2 j>J^^^J Q&Cj,i '' j>J 

where ^ = {Q : r(P, Q) > Je} = Uj> j • □ 

6.2. Application to misspecification. When applying the above in the 
proof for consistency in misspecified models, the problem is to test the true 
distribution Pq against measures Q = Q{P) taking the form dQ = {po/p*)dP 
for P € In this case, the Hellinger transform takes the form PaiQi Po) = 
Po{p/p*)°' and its right derivative at a = is equal to Polog{p/p*). This is 
negative for every P € ^ if and only if P* is the point in at minimal 
Kullback-Leibler divergence to Pq . This observation illustrates that the mea- 
sure P* in Theorem 2.1 is necessarily a point of minimal Kullback-Leibler 
divergence, even if this is not explicitly assumed. We formalize this in the 
following lemma. 

Lemma 6.4. // P* and P are such that Polog{po/p*) < oo and Poip/ 
p*) < oo and the right-hand side of (2.9) is positive, then Pq log(j'o/p*) < 
Polog^po/p). Consequently, the covering numbers for testing Nt{e, ,'^,d;Po,P*) 
in Theorem 2.1 can be finite only if P* is a point of minimal Kullback-Leibler 
divergence relative to Pq. 

Proof. The assumptions imply that Po{p* > 0) = 1. If Po{p = 0) > 0, 
then Polog(po/p) = and there is nothing to prove. Thus, we may assume 
that p is also strictly positive under Pq. Then, in view of Lemma 6.3, the 
function g defined by g{a) = Po{p/p*)°' = PaiQ,Po) is continuous on [0,1] 
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with g{0) = Pq{p > 0) = 1 and the right-hand side of (2.9) can be positive 
only if g{a) < 1 for some a E [0, 1] . By convexity of g and the fact that g(0) = 
1, this can happen only if the right derivative of g at zero is nonpositive. In 
view of Lemma 6.3, this derivative is ^'(O-l-) = Polog{p/p*). 

Finiteness of the covering numbers for testing for some e > implies that 
the right-hand side of (2.9) is positive every P € ^ with d{P,P*) > 0, as 
every such P must be contained in one of the sets Bi in the definition of 
Nt{e, d; Pq,P*) for some e > 0, in which case the right-hand side of (2.9) 
is bounded below by 

If Pq{p/p*) < 1 for every P £ then the measure Q defined by dQ = 
(po/p*) dP is a subprobability measure and, hence, by convexity, the Hellinger 
transform PaiPo,Q) is never above the level 1 and is strictly less than 
1 at a = 1/2 unless Pq = Q. In such a case there appears to be no loss in 
generality to work with the choice q = 1/2 only, leading to the distance d as 
in Lemma 2.3. This lemma shows that this situation arises if ^ is convex. 
□ 



The following theorem translates Theorem 6.2 into the form needed for 
the proof of our main results. Recall the definition of the covering numbers 
for testing Nt{e, ^ ,d; Pq, P*) in (2.2). 

Theorem 6.3. Suppose P* e ^ and Pq{p/p*) < oo for all P e ^. As- 
sume that there exists a nonincreasing function D such that, for some en > 
and every e> En, 

(6.5) Nt{e^^,d-Po,P*)<D{e). 

Then for every e> £n there exists a test ( depending on e > 0) such that, 
for every J e N, 



P^cPn<D{e)- 



I _ g-ne2/4 ' 

(6.6) 

sup Q(P)"(l-</>n)<e-"-^'^'/^ 

{P£^:d{P,P*)>Je} 

Proof. Define =S as the set of all finite measures Q{P) as P ranges 
over (where po/p* = if po = 0) and define t{Qi,Q2) = d{Pi,P2). Then 
Q{P*) = Po and, hence, d{P,P*) =t{Q{P),Po). Identify P of Theorem 6.2 
with the present measure Pq. By the definitions (2.2) and (6.2), we have 
Nr{£,^) < Nt{e,^,d;Po,P*) < D{e) for every e > Therefore, the test 
function guaranteed to exist by Theorem 6.2 satisfies 

oo oo 

Po>n < E DUe)e-^^'''/^ < Die) ^ e-'^^"^'/^ 
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because D is nonincreasing. This can be bounded further (as in the assertion) 
since, for all < x < 1, J2n>i ^" ^ x/{l — x). The second line in the assertion 
is simply the second line in (6.3). □ 

7. Proofs of the main theorems. The following lemma is analogous to 
Lemma 8.1 in [8] and can be proved in the same manner. 

Lemma 7.1. For given e > and P* £ ^ such that Polog{po/p*) < oo, 
define B{e,P*;Po) by (2.3). Then for every C > and probability measure H 
on 

C» n 
/n4(^*)^n(P)<n(i?(e,P*;Po))e^"^'(^+^) 
1=1 ^ 

Proof of Theorem 2.2. In view of (2.5), the conditions of Theo- 
rem 6.3 are satisfied, with the function -D(e) = e*^^" (i.e., constant in e > e„). 
Let (pn be the test as in the assertion of this theorem for e = Msn and M a 
large constant, to be determined later in the proof. 

For C > 0, also to be determined later in the proof, let r2„ be the event 

» n 

(7-1) / n4(^O^n(P)>e-(i+^)"^'n(i?(e„,P*;Po)). 

Then P(5^( JT" \ J7„,) < l/{C^nel), by Lemma 7.1. 

Set n„(e) = n„(P e ^■.d{P,P*) > e\Xi,...,Xn). For every n > 1 and 
J G N, we can decompose 

P^UniJMsn) = P^{Un{JMen)M + P,^^ (]!„ ( JMe„) (1 - Mln'^) 

(7-2) 

+ Po"(n„(JMe„)(l-0„)lnJ. 

We estimate the three terms on the right-hand side separately. Because 
nn(e) ^ 1) the middle term is bounded by l/(C^ne^). This converges to zero 
as ne^ — > oo for fixed C and / or can be made arbitrarily small by choosing 
a large constant C if ne^ is bounded away from zero. 

By the first inequality in (6.6), the first term on the right-hand side of (7.2) 
is bounded by 

g(l-AfV4)n4 
P^iUniJMeM < Po>n < ^_^_„M^,2/4 - 

For sufficiently large M, the expression on the right-hand side is bounded 
above by 2e~"'^"^ for sufficiently large n and, hence, can be made arbi- 
trarily small by choice of M, or converges to for fixed Af if ne^ — > oo. 



< 



1 



40 



B. J. K. KLEIJN AND A. W. VAN DER VAART 



Estimation of the third term on the right-hand side of (7.2) is more in- 
volved. Because Po{p* > 0) = 1, we can write 

P^inn{JMen){l-<Pn)lnJ 
^^■^^ -kp,Pn>JMs„ m=l{p/P*m) dU{P) 



i^m=iip/p*){Xi)duiP) 

where we have written the arguments for clarity. By the definition of j 
the integral in the denominator is bounded below by e~^^''''^^"^"n(i?(e„, P*; 
Pq)). Inserting this bound, writing Q{P) for the measure defined by dQ{P) = 
(Po/p*) dP, and using Fubini's theorem, we can bound the right-hand side 
of the preceding display by 

(7.4) / QiPni - <A„) dU{P). 

Il{B[en,P*;Po)) Jd{p,p*)>jM£„ 

Setting ^n,j = {P ^ ^-.Meni < d{P,P*) < Men{j + 1)}, we can decom- 
pose {P : d{P, P*) > JMsn} = Uj> J ^n,j- The tests (j)n have been chosen to 
satisfy the inequality Q(P)"(1 - <j)„) < e-"^'^^''^'/^ uniformly in P E ^nj- 
[Cf. the second inequality in (6.6).] It follows that the preceding display is 
bounded by 

Jl+C)nel .o,,2 2u 

U{B{e^,P*;Po))^/ ''^^-'^^ 



<E 



il+C)nel+nelM^f/S-nfM^el/A 



by (2.11). For fixed C and sufficiently large M, this converges to zero if ne^ 
is bounded away from zero and J = J„ — > oo. □ 

Proof of Theorem 2.1. Because 11 is a probability measure, the nu- 
merator in (2.11) is bounded above by 1. Therefore, the prior mass condition 
(2.11) is implied (for large j) by the prior mass condition (2.4). We conclude 
that the assertion of Theorem 2.1, but with M = Mn — > oo, follows from 
Theorem 2.2. That in fact it suffices that M is sufficiently large follows by 
inspection of the preceding proof. □ 

Proof of Theorem 2.4. The proof of this theorem follows the same 
steps as the preceding proofs. A difference is that we cannot appeal to the 
preparatory lemmas and theorems to split the proof in separate steps. The 
shells ^n,j = {P ^ P-Mjen < d{P,^*) < M{j + l)e„} must be covered 
by sets Bnj^i as in the definition (2.16), and for each such set we use the 
appropriate element Pnji G ^* to define a test (l)nj,i and to rewrite the 
left-hand side of (7.3). We omit the details. □ 
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8. Technical lemmas. Lemma 8.1 is used to upper bound the KuUback- 
Leibler divergence and the expectation of the squared logarithm by a func- 
tion of the Li-norm. A similar lemma was presented in [16], where both p 
and q were assumed to be densities of probability distributions. We gener- 
alize this result to the case where g is a finite measure and we are forced to 
use the Li instead of the Hellinger distance. 

Lemma 8.1. For every b> 0, there exists a constant > such that, 
for every probability measure P and finite measure Q with < h^{p-,q) < 

Flog (i + i log, ^ + i log, p(?)') + Hp - ,11, 

2 /I 11 /^\fe\2 



Proof. The function r : (0, oo) M defined implicitly by logx = 2{^/x — 
1) — r{x){^/x — 1)^ possesses the following properties: 

• r is nonnegative and decreasing. 

• r(x) ~ log(l/3;) as 2; | 0, whence there exists e' > such that r{x) < 
21og(l/x) on [0,e']. (A computer graph indicates that e' = 0.4 will do.) 

• For every 6 > 0, there exists e'^ > such that x'^r^x) is increasing on [0, e^'] . 
(For 6 > 1, we may take e^' = 1, but for b close to zero, e'^ must be very 
small.) 

In view of the definition of r and the first property, we can write 

Plog ^ = -2P (,f^ -l]+Pr(^-) (,f^ - l] ' 
q \\Ip / \pJ\\Ip / 



iV 



<h\p,q) + l- J qdfi + Pil^^^l^^ 

< h'^ip, q) + \\p -q\\i+ r{£)h'^{p, q) + < e 

for any < e < 4, where we use the fact that | \fqlp — 1| < 1 if qjp < 4. Next 
we choose e < e'l and use the third property to bound the last term on the 
right-hand side by P{p/ q)^e^r{£). Combining the resulting bound with the 
second property, we then obtain, for e < e' A e'l A 4, 



Plog^</i2(p,g) + ||p-g||i + 21og^/i2(p,g) + 2e^log^pf^'' 
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For = h'^{p,q)/ P{p/q)^ , the second and third terms on the right-hand 
side take the same form. If h?{p,q) < ebP{p/q)'' for a sufficiently small ej,, 
then this choice is eligible and the first inequality of the lemma follows. 
Specifically, we can choose et, < {e' A e'^ A 4)^. 

To prove the second inequality, we first note that, since | logxl < 2\y/x — 1| 
for x>l, 

Next, with r as in the first part of the proof, 

< Sh^ip, q) + 2r2(e)/i2(p, q) + 2e\'^{e)p{^^ , 

for £ < e^'/2' view of the third property of r. (The power of 4 in the first 
line of the array can be lowered to 2 or 0, as \ \fqjp— 1| < 1.) We can use the 
second property of r to bound r(e) and next choose = h^{p,q) / P{p/ q)^ 
to finish the proof. Specifically, we can choose Eh < (e' ^^'1/2)^- '-' 

Lemma 8.2. // p,Pn,Poo o-f^ probability densities in Li{fi) such that 
Pn^Poo as oo, then liminf„^oo ^'log(p/pri) > Plog(p/poo)- 

Proof. If X„ = Pn/p and X = Poo/p, then Xn ^ X in P-probability 
and in mean. We can write Plog{pn/p) as the sum of P{logXn)lx„>i and 
P{logXn)lXn<i- Because < (logx)]la;>i < x, the sequence (logX„)]lx„>i 
is dominated in absolute value by the sequence \Xn\, and, hence, is uni- 
formly integrable. By a suitable version of the dominated convergence the- 
orem, we have P{logXn)lx„>i -P(log^oo)lA'oo>i- Because the variables 
(logX„)]lx„<i are nonnegative, we can apply Fatou's lemma to see that 
limsupP(logX„)lx„<i < P(logXoo)lx^<i. □ 
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